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We present results for the performance of QCD code on ALiCE , the Alpha-Linux Cluster Engine at Wuppertal. 
We describe the techniques employed to optimise the code, including the metaprogramming of assembler kernels, 
the effects of data layout and an investigation into the overheads incurred by the communication. 



1. Introduction 



In a typical lattice QCD project the total run- 
time of code on a supercomputing platform is 
often measured in months or even years. This 
means that even a modest improvement in the 
performance of the code can yield very tangible 
benefits. There are two aspects to the optimisa- 
tion of code for parallel machines: single-node op- 
timisation and the minimisation of the overhead 
incurred by inter-node communications. 

The former requires that the code be written to 
take full advantage of the high performance avail- 
able from todays advanced hardware, The latter 
is of particular importance on cluster machines, 
like ALiCE , where the scalability of code can be 
a serious problem. 

2. Single node optimisation 

Experience tells us that the dominant part of a 
typical lattice QCD code is that implementing the 
multiplication of a vector by the fermion matrix 
so it is here that the effort should be made. Sec- 
ondly, the use of hand-coded optimised assembler 
routines can dramatically improve performance 
since the programmer can use information about 
the code which is unavailable to the compiler. 

The disadvantage with assembler routines is 
that they are difficult to develop and harder to 
maintain, in addition to the obvious lack of porta- 
bility. We address these problems by adopting a 
metacoding approach; writing a C-l — h program to 
write the assembler code for us. We have devel- 
oped special software tools to enable this. 
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Figure 1. Wilson matrix multiplication in single 
(above) and double (below) precision. The verti- 
cal lines indicate the volumes at which the data 
fills the level 1 (LI) and level 2 (L2) caches. 
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2.1. Metacode software toolkit 

The first stage in creating the assembler routine 
is to reduce the computational task to elementary 
assembler-level abstract instructions, e.g. load a 
datum from memory into registers, perform arith- 
metic on the data, cache management, textitetc. 

In order to write the metacode we have devel- 
oped a system of C+- 1- classes and routines which 
automatically schedule the instructions to hide 
the instruction latencies as much as possible and 
automatically manage the register usage. 

When the metacode written using these rou- 
tines is compiled and run, the abstract instruc- 
tions with their arguments are translated into an 
actual assembly language and written to a file. 

By basing the toolkit design on an abstract 
RISC ISA it should be possible to produce as- 
sembler code for any RISC machine by changing 
the architecture-dependent parameters. Here we 
show the results on ALiCE , a cluster of Compaq 
DS10 servers which have a 616 Mhz, 4-way super- 
scalar Alpha 21264 processor with a 64Kb 2- way 
set-associative level 1 (on-chip) data cache and a 
2Mb level 2 (off-chip) cache. 

2.2. QCD kernels 

An advantage of the metacode toolkit as that it 
permits a large degree of flexibility in writing var- 
ious assembler kernels; different approaches can 
be tried and compared, and the kernels can be 
rewritten to adapt to changes in the action or al- 
gorithm. 

Figure ^ shows the improvement, over a wide 
range of lattice volumes, in the performance of the 
Wilson matrix multiplication routine when writ- 
ten with assembler kernels over that of the origi- 
nal implementation in C. To demonstrate the ef- 
fect in a more realistic environment, the inversion 
of the Wilson matrix using BiCGStab is shown in 
figure @. 



3. Cluster performance 

ALiCE is clustered using ParaStation 3 over 
64bit/33MHz Myrinet. Our code uses MPICH 
1.2.3 to do the communications. 

We test the multinode performance of the 
BiCGstab solver on a 16 4 lattice running on n 
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Figure 2. Wilson matrix BiCGstab, comparing 
single (dashed) and double (solid) precision. 



= 1, 2, 4, 8 and 16 nodes arranged in a 1- 
dimensional (1 x n) grid and a 2-dimensional 
(square) grid. We use a standard metric of par- 
allel performance: 



speedup 



speed on n nodes 
speed on 1 node 



Our original implementation used a conven- 
tional array ordering for all the fields, where 
each lattice site with coordinates (xq, xi, x 2 , £3) is 
numbered n = X3 + N3X2 + N3N2X1 + N3N2N1X0 
where 7V M is the size of the local lattice in direc- 
ton fi. This is illustrated in figure |] (left) which 
shows that while the data along the boundary in 
one direction is contiguous, in the second direc- 
tion it is strided. Investigations into the perfor- 
mance of our MPI communications suggest that 
the communication of strided data introduces an 
overhead of at least 20% compared to contiguous 
data. This explains the poor scaling of the solver 
on a 2-dimensional grid shown in figure |^. The 
scaling on the 1-dimensional grid suffers from the 
increasingly unfavourable surface-to- volume ratio 
of the local lattice. 

The solution to these problems appears to be 
to rearrange the data layout so that the sites on 
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Figure 3. Speedup of the BiCGstab solver with 
the original data layout. 
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Figure 5. Speedup of the BiCGstab solver with 
the new datat layout. 



the lattice boundaries are ordered in a contiguous 
fashion, illustrated in figure f| (right). 

Separating the boundary and interior sites in 
this way has the additional advantage that com- 
putation can proceed on the interior sites while 
the boundary sites are waiting for a non-blocking 
communication to finish. Figure ^| shows that 
using this new data layout greatly improves the 
speedup of the solver. The new data layout does 
not adversely affect single node performance. 
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Figure 4. Illustration of the old (left) and new 
(right) data layout; the shaded areas show data 
on the boundary which is communicated. 



4. Summary 

We have introduced a flexible software toolkit 
[1] which can successfully generate optimised as- 
sembler routines for performance-critical parts of 
our lattice QCD code. On a single node we see 
a 100-150% improvement in the Wilson matrix 
solver performance at single precision and 50- 
100% at double precision. 

We demonstrate that good scaling performance 
can be achieved on ALiCE if the data layout and 
communication strategy is carefully adapted to 
suit the communication needs. 
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